Overview

§0o|0z|1 \c|1o|2v|2i|3d|3[++] \1|39|3 \v|3i|3s|2u|2a|2l|2[-]i|2[-]s|1[-]a|1[-]t|1[-]i|1[-]o|1[-]n|0[-]s|0

The time-series line in the heading above is derived from current incidence cases of COVID-19 in Australia.

Column

Introduction

This web site is a joint effort of researchers at the South Western Sydney Clinical School and the Centre for Big Data Research in Health at the UNSW Faculty of Medicine, the Econometrics and Business Statistics Research Group of Monash University, and at the Ingham Institute for Applied Medical Research in Liverpool, Sydney.

The intent is to offer a range of principled epidemiological and statistical analyses and visualisations of current COVID-19 data which go beyond the now ubiquitous world maps and cumulative incidence charts.

The broad themes for the analyses and visualisations currently available are listed in the menu at the top of this page – more will be added in due course. For each theme there is an introductory page explaining the motivating ideas and methodology employed for each of the visualisations or analyses for that theme, which are available in the subsequent frames (the series of rectangles at the top of each page). Additional notes or commentary appear on the right of some pages.

An Australian focus with an international perspective

This site has been created by researchers at Australian universities, and hence the focus is on the situation in Australia, within the broader international context – we are, after all, all in this together. However, we hope that some of the analyses and visualisations on this site might be useful elsewhere, and to that end, all the R source code used to create this site if freely available – please see the Technical details tab above for details on software used, and where to find the source code.

Contributors

Creative Commons License
The content on this site is licensed under a Creative Commons Attribution 4.0 International License. If you wish to re-use any of the content, please retain the attribution which appears on each chart or set of charts, or otherwise provide that attribution alongside the content that you re-use.

In turn, we acknowledge RECON for providing financial support, and European Centre for Disease Control who have provided up to date data on the COVID19 pandemic.

Project has been led by Timothy Churches, Nicholas Tierney, with thanks to Stuart Lee, Dianne Cook, Miles McBain, and Rob Hyndman

Visualisations and analysis made available in the covidrecon package

Source code for this visualisation are available at covid-flexdashboard

Technical Details

Data Analysis has been entirely created within the R programming language using Rstudio.

Packages used include:

Incidence

Explanation

Incidence is the epidemiological term for the number of cases of a disease meeting some case definition in a specified time period. Here we present the daily incidence – that is, the number of new cases each day – of COVID-19 for a range of countries, including Australia, of course. Each country may be using a slightly different case definition, although most of the countries presented here have been using case definitions aligned with those recommended by the WHO. Three types of case definition are used, simultaneously:

  • Suspected cases
  • Probable cases
  • Confirmed (by laboratory test) cases

All of the data reported here are for confirmed cases, with a few exceptions (China included cases diagnosed via lung CT scan for a few days in Februrary, 20020, but we have adjusted the data for those days as far as possible to remove that anomaly.

The data are drawn from the European CDC, which has collected data from various governing agencies around the world. There are some small discrepancies between these data and those given by the Australian government, and indeed other national governemnst, due to timing issues relating to when cases are tabulated each day, and so on. However, we ebelive the the European CDC to be teh best source of automatically downloadable, machine-readable data there is right now.

Note, however, that because the confirmed case definition depends on laboratory confirmation, it is influenced by the number of lab tests (RT-PCR, reverse transcriptase polymerase chain reaction tests on nasopharyngeal swabs or sputum) done by each reporting country or jurisdiction. Obviously the number of reported cases is bounded by the number of such labs tests which are done, but the degree of under-ascertainment is also affected by the policies in place which determine which potential COVID-19 cases are tested. These policies are completely country-specific and have changed over time.

Another issue with these data is that it is vastly preferrable to analyse incidence by (presumed or definite) date of onset of symptoms, rather than by date of notification or date of reporting. Theer may be variable delays in the processing of laboratory tests and the reporting of cases to central authorities. Tabulation of incidence counts by date of onset overcomes this problem. Almost all national and jurisdictional health authorities will be collecting data on presumed date of onset for each case (although, inexplicably, it is not one of the data items on the WHO-recommended data collection form). One reason for not using date of onset is that it may be incomplete, but this ignores the fact that statistical imputation can be used to validly fill in those missing dates of onset. There are also a body of methods, collectively known as nowcasting, that use multivariate time-series models to convert data tabulated by date of notification/reporting to (estimated) date of onset. If national or jurisdictional health authorities do not have the technical capacity to undertake such value-adding to their own data, they should make the required data available to trusted partners in the academic sector who can undertake such statistical manipulation for them (or help authorities implement such processing internally).

Cumulative Incidence

Cumulative incidence is, as the name suggests, just the cumulative sum of the daily incidence - that is, a running total of the number of cases. Reporting of cumulative COVID-19 seems to dominate the mainstream media, but it has many disadvantages. In particular, a cumulative sum of case counts is always monotonically increasing – it can only ever go up, or at best, remain flat if there are no new incident cases. This tends to obscure the rate of chnage in incidence over time – subtle, or even large changes in the slope of the cumulative incidence curve are difficult to see.

Semi-log cumulative incidence chart

The first chart presented here is a semi-log cumulative inncidence chart. This chart seems to have been popularised by John Burn-Murdoch in the Financial Times, but it appears it was first used by Matt Cowgill from Australia’s very own Grattan Institute. It has since been widely copied and reproduced. Please see a blog post by Prof Rob Hyndman for further discussion of this chart, and some alternative analyses (one of which we also reproduce in the section on time-varying effective reproduction numbers (\(R_{t}\)).

There are several variations on the _Grattan+ chart presented, please see the notes for each one.

Epicurves

The epicurve is perhaps the most-used chart in field epidemiology and outbreak control. It is simply a chart of daily (or weekly, for slower-moving diseases) incidence (new cases), traditionally shown as a bar chart. It gives an immediate sense of whether an outbreak or epidemic is in a growth phase, with increasing incident counts each successive day, or in a decay phase, with decreasing counts each successive day. Note that the cumulative count will still increase, day-on-day, even when an outbreak or epidemic is in a decay phase. Only when the epidemic has been completely extinguished will the cumulative incidence stop increasing. That’s one of the reasons why cumulative incidence charts are rarely used by epidemiologists.

Three variations of epicurves are provided here:

  • the usual epicurve, on a linear y-axis, and with the ranges specific for each country to maximise the amount of detail discernable
  • the same chart, but with a logarithmic y-axis, which allows the periods with lower counts to be inspected in greater detail;
  • the usual epicurve, on a linear y-axis, but with the same y-axis scale across all countries, which clearly shows the relative incidence in each country. This chart is quite startling.

Semi-log cumulative incidence for selected countries – the Grattan Institute chart


This chart shows the cumulative cases of COVID-19 for selected countries on a logarithmic y-axis scale, with the dates on the x-axis converted to the number of days since each country shown exceeded 100 cumulative cases, on a linear x-axis scale (hence the name semi-log, since only one of the two axes is logarithmic).

Note that countries “peel off” the diagonal trajectory as their rate of new (incident) cases reduces. If the line for a country is horizontal, it means there are no new cases occurring there.

The curve for Australia is clearly flattening, and we are keeping good company with China, South Korea and New Zealand as the other countries with nearly horizontal trajectories. Note that after considerable initial success in containing COVID-19 spread, both Japan and Singapore are now on a upwards trajectory, but the slopes of those trajectories are much shallower than the other countries shown, indicating much slower spread in those countries.

Semi-log cumulative incidence for selected countries – aligned to start of epidemic in each country


This chart is a variation on the previous Grattan Institute chart. It isn’t really an improvement on the Grattan chart, but is shown here to illustrate the sensitivity of the Grattan chart to the method used to align the dates on the x-axis. In the chart shown at left, the dates on the x-axis are aligned to the approximate start of the COVID-19 epidemic in each country. The start dates are chosen automatically using a non-parametric changepoint detection algorithm, (see the changepoint.np package for R). The changepoints for each country shown are as follows:

Country Detected start of epidemic
China 2020-01-18
Singapore 2020-01-27
South Korea 2020-01-30
Japan 2020-02-12
Italy 2020-02-21
Spain 2020-02-24
Germany 2020-02-25
France 2020-02-25
USA 2020-02-26
UK 2020-02-27
Australia 2020-03-01
New Zealand 2020-03-02

Semi-log daily incidence for selected countries – aligned to start of epidemic in each country


This is another variation on the Grattan Institute chart, but this time showing daily incidence on the y-axis, rather than daily cumulative incidence. It is basically the same information as shown in the epicurve charts in the subsequent frames (accessed via the blue rectangles at the top of the page), but all of the selected countries are shown on one plot, and the dates are aligned by the detected start of the epidemic in each country.

One problem is that the daily incidence curves are rather noisy. We can address that by smoothing them, as shown in the next frame.

Semi-log smoothed daily incidence for selected countries – aligned to start of epidemic in each country


In this chart, we can now discern three distinct groups of countries:

  • an upper group of just one, the USA, where the epidemic is still growing rapidly.
  • a middle group comprising the UK and European countries
  • a lower group, which includes Australia , New Zealand, South Korea, China, Japan and Singapore, which all have their local COVID-19 epidemics under control (but definitely not eradicated). Note however that Japan and Singapore are now exiting that lower group and heading upwards.

Epicurves for selected countries


The charts shown here are epicurves – that is, daily count of incident (new) cases, using data collated by the European CDC.

Note that there are some days where the number of cases appears to spike upwards, followed by a decrease the following day. This indicates that there may be some data discrepancies in how the European CDC is capturing data from WHO Situation Reports. It underlines the importance of nations providing reliable machine-readable access to their own COVID-19 data. By “machine-readable” we mean CSV or JSON data files which are automatically downlaodable, or an API which can be queried automatically to yield such data. Neither of those are difficult to establish, yet nearly all national governments have failed to provide such data, leaving it to third-party agencies and citizen-science efforts to piece together the required data in a manner that permits ongoing analysis. There is, for example, no official machine-readable source of national COVID-19 data provided by the Australian government. As far as we are aware, NSW is the only State or Territory government that has made any effort in that direction by providing some machine-readable data, which we leverage in the \(R_{t}\) for NSW theme (see menu above).

Epicurves for selected countries (common y-axis scale)


Forcing the y-axis scale to be the same for all the plots in this chart means that, compared to the previous chart where the y-axis could change for each country, the country with the largest number cases, in this case, the USA, appears the same, and the rest of the plots appear smaller.

This provides important context: the number of cases in the USA currently dominates relative to other countries. The bottom row of countries are barely visible, by comparison.

Epicurves for selected countries (logarithmic y-axis scale)


The final variation the epicurve chart, show here, uses a logarithmic y-axis scale. This emphasises lower counts, which is useful for inspecting the beginnings and ends of the epicurves, or the middle parts where there is more than one “wave”, as in China and Singapore.

National-level \(R_{t}\)

Explanation

To understand the spread of COVID19, we use a statistic called “R0” (R-nought). This is the number of people we expect to be infected from one COVID19 infected person arriving in a population.

We are estimating R0 from the available data, and are specifically estimating the effective reproduction number (different to basic reproduction number). This reflects the current state of a population, which may include some infections.

What does R0 mean?

If R0 is 1

  • One person arriving to a population could spread it to one other person. Those people could then spread it to one other person. This is easier to manage.

If R0 is 0.3

  • For every 10 people infected, it would spread to 3 other people.

If R0 is 2

  • Every person arriving in the population can infect 2 more people. This can quickly get out of control.

What do we want R0 to be?

  • We want the R0 to be below 1, and ideally as close to 0 as possible.

Estimating Effective reproduction number

We estimate the effective reproduction number using mathematical modelling. This means that the estimated numbers are dependent upon the mathematical model, and while they are our best possible guess at R0, they may be some inaccuracies.

These estimates reflect the number of new cases we expect from an infected person as they interact with a population. The higher the number, the worse the control of the infection, and the more people will be infected. We want this number to be below 1, and preferably as close to 0 as possible.

Misconceptions / mistakes

R0 is not a rate, and has no units of time.

This means it is not a rate of infection, and does not tell us how fast an infection spreads.

Limitations

These estimates are for all countries do not account for people who bring the infection to the population by travelling. This means that we are assuming all of these infections are spread by the community. However, Australia actually has quite a large proportion of people who are travelling from overseas with COVID19. This means that these estimates might be over-estimating R0. We have explored what is possible if we improve incorporate these imported cases, where we have this data for New South Wales.

Selected Countries


The effective R estimates for each country indicate that most countries are bringing the spread of the virus under control. However there is some wide variation. The measurement of effective R for Australia has decreased substantially, now falling below one, indicating that we have now started to control the outbreak.

Per Country


This is the same information presented as in the previous graphic, but with each country split into its own graph. This allows us to see the trajectory of effective R.

Most countries are decreasing, but we notice that Japan and Singapore are fluctuating, due to recent outbreaks.

Per Country (ribbons)


This shows the uncertainty around the estimate of the effective R. We notice that as time goes on, there is less uncertainty around the estimate, so we can more confident in our estimate at these later times.

Australia


Japan


Cool!

South Korea


Cool!

China


Cool!

Germany


Cool!

Spain


Cool!

France


Cool!

Italy


Cool!

Singapore


Cool!

United Kingdom


Cool!

USA


Cool!

\(R_{t}\) for NSW

Explanation

The total number of incident cases arising at timestep \(t\), \(I_t\), is the sum of the numbers of incident local (\(I_{t}^{local}\)) and imported (\(I_{t}^{imported}\)) cases,

\[ I_t = I_t^{local} + I_t^{imported} \]

It is assumed that, if imported cases exist, they can be distinguished from local cases, for instance through epidemiological investigations, so that \(I_{t}^{local}\) and \(I_{t}^{imported}\) are observed at each timestep.

\[\Lambda_t(w_s) = \sum_{s=1}^t (I_{t-s}^{local} + I_{t-s}^{imported}) w_s = \sum_{s=1}^t I_{t-s} w_s \] \[ \mathbb E(I_t^{local} | I_0, I_1, \ldots, I_{t-1}, w_s, T_t) = R_t\Lambda_t (w_s)\]

NSW – locally-acquired & overseas-acquired cases treated separately
(cases under investigation excluded)
parametric serial interval distribution


Commentary

NSW – locally-acquired & overseas-acquired cases treated separately
(cases under investigation excluded)
serial interval distribution estimated from data


Commentary


NSW – adjusting for potential under-ascertainment, locally-acquired & overseas-acquired cases treated separately, (cases under investigation excluded), parametric serial interval distribution


In these plots, the counts of incident cases with presumed local sources of infection have been inflated by a factor of 10, and the counts of cases with presumed overseas sources of infection inflated by a factor of 1.5. This mimics ten-fold under-ascertainment of locally-transmitted cases, and 50% under-ascertainment of inbound cases.


NSW – adjusting for potential under-ascertainment, locally-acquired & overseas-acquired cases treated separately, (cases under investigation excluded), serial interval distribution estimated from data


In these plots, the counts of incident cases with presumed local sources of infection have been inflated by a factor of 10, and the counts of cases with presumed overseas sources of infection inflated by a factor of 1.5. This mimics ten-fold under-ascertainment of locally-transmitted cases, and 50% under-ascertainment of inbound cases.


NSW – locally-acquired & overseas-acquired cases treated separately
(cases under investigation included)
parametric serial interval distribution


Commentary


NSW – locally-acquired & overseas-acquired cases treated separately
(cases under investigation included)
serial interval distribution estimated from data


Commentary


NSW – all cases treated as locally-acquired
parametric serial interval distribution


From data commentary blah blah blah.


NSW – all cases treated as locally-acquired
serial interval distribution estimated from data


From data commentary blah blah blah.